Prosody-based search features in information retrieval
نویسندگان
چکیده
Massive amounts of digital audio material are stored in databases to be accessed via digital networks. A major challenge is how to organise and index this material to best support retrieval applications. Not enough manpower will ever be available to index the terabytes of digital material by hand. Methods for interpreting the complex data automatically or at least semi-automatically must therefore be found. Valuable information in the form of prosodic features can be automatically extracted from the speech signal; computation of these features significantly enhances the automatic interpretation of especially the emotional content of the recording/audio file. In this paper, ways of utilizing prosodic or acoustic features of speech to develop retrieval applications are discussed. Also, the MediaTeam Emotional Speech Corpus is introduced. Information retrieval The current stages of the digital revolution – in particular, the Internet/World Wide Web and the new ways of storing and retrieving information – have produced very large collections of digital data (for example, text, audio and video libraries). Such collections are nowadays an essential part of a properly functioning information society. Thus there is now a clear need for specific methods for effectively browsing and searching such digital libraries. From the viewpoint of the end user, the problem is, of course, how to find the required information from numerous, and increasingly large, databases. The problem concerns mainly the currently booming media types, such as digital speech, music, and image, where the search criteria often include semantic concepts. The current demand for automatic interpretation of digital data has resulted in intensive research on content-based retrieval. The central problem is how to narrow down the semantic gap between the concept-based and the content-based approaches to data indexing. That is, the user of an information retrieval application is interested in an (abstract) idea or a concept – for example, when searching a digital video library, he/she might want to find a certain impressive scene in an action movie (yet the person may remember the scene only vaguely). However, the search engine cannot, as yet, interpret the data at such an abstract level: the information retrieval applications do not “think” or “reason” the same way as people do. The search engine is not able to transform the user’s vague query language into a computerinterpretable description language of the data. From the viewpoint of research, then, the ambition is to close the gap between the natural semantic concepts used by a person seeking information and the computer-interpretable database descriptors automatically derived from the data contents. Information retrieval applications are based on an undestanding of the content of the file(s) and there are, basically, two approaches. On the one hand, human interpretation can be utilized to generate semantic keywords but this tends to be inconsistent over time, and is very expensive. On the other hand, automatic interpretation of data can be developed: it is cheap, replicable and scalable. The automatic interpretation may be wrong but it is consistently so: the search results, whether they be good or bad, are at least systematic. As for the development of search engines, a critical fact is thus that the digital databases are so massive: the organizing and indexing of the databases is very demanding, and can never be done manually. Therefore, at least semiautomatic means for the interpretation of complex data must be developed. In some cases, there is available textual meta-data from which key words can be extracted for a specific (video) recording (for example, film-producers may Speech, Music and Hearing 106 include scene scripts in the digital movie product). Talk shows, interviews and debates, which are stored in digital databases, occasionally contain such scripts but this is by no means the rule. Compared with information retrieval for textmedia, semantically rich content-based information retrieval for audio (and video) is still an unattained goal (Bregman, 1990; Smeaton, 2000). With modern text mining methods, it is possible to automatically categorize and summarize very large documents, and locate and extract the relevant textual information. Image, video and audio are so rich in content that such generic information retrieval applications are still far away. However, some progress has already been made. The development of automatic speech recognition is a partial solution to the problem of information retrieval from large and semantically rich (audio) databases: it enables the computation of topic-related keywords, thus locating those specific parts of the file where the interesting and relevant words occur. The speech retrieval applications are all based on some kind of recognition (where phone recognition precedes word recognition). Full speech recognition transcribes spoken utterances into text which can be analyzed to achieve a syntactic and semantic description of the utterances, whereas single word recognition only locates the relevant “spots” in the spoken messages. Full speech recognition is speakerdependent and expensive, and requires training. Word spotting applications are often sufficient and they work because of the pre-defined vocabulary (the search is for these words only). Speech signal interpretation can be enhanced by means of prosody-based search features: general prosodic features of speech can be utilized to automatically chart the distribution of cohesive paragraphs, or “paratones”, in the speech data. Prosodic features such as F0 and intensity level can be used for topic and phrase boundary identification (Swerts & Ostendorf, 1997). The speech can be segmented into audio paragraphs/paratones, using acoustic/prosodic information, after which automatic speech recognition can be applied to each paratone. By means of the more intelligent search features, the user of the information retrieval application could find all those lexically/syntactically relevant, and prosodically cohesive, parts of the audio which deal with a certain pre-defined topic. It is usually the case that, as for audio databases, the search engine can utilize only very general key words such as the topic/subject matter of the discussion/recording, the personal data of the speakers and the time and place of the recording. The search engine can, surely, locate a number of useful candidate audio files on the basis of this information but it is clear that the search cannot extract the deeper semantic content of the speech data. Even if keywords describing the main topic of the stored conversation/debate/interview can be spotted by means of automatic speech recognition methods, much of the content of the speech situation remains undescribed. Here specific acoustic/prosodic information can be of use. The user of an information retrieval application might be interested in finding those parts of an audio file where something “interesting” happens (for example, where the speakers sound very excited or annoyed). Thus, to radically improve the current content-based information retrieval methods, we need to add prosody-based elements to the list of search features. Prosody, in some subtle ways, signals not only the textual structure of the discourse (the beginnings and endings of topics) and turntaking but also the emotional/affective state of the speakers. Speech is, in actual fact, quite rich in different kinds of indexical markers relating, for example, to the affective/medical status or the socio-economic/professional background of the speaker (Laver, 1994). These features are, to some extent at least, speakerand languageindependent (Toivanen, 2001). This kind of (prosodic) information could be valuable for database indexing and retrieval of audio material. Currently available search engines cannot make use of the acoustic parameters underlying “angry” voice, for example, but more intelligent robots could be designed to utilize acoustic information. Emotional speech databases The computer, or the search engine, cannot evaluate the emotional content of spoken messages on the basis of prosodic/acoustic information unless the computer is explicitly taught the way in which human listeners perceive emotions in speech. That is, in order to find out the quantifiable relationship between prosodic features of speech and perceived emotions, listening experiments and subjective evaluations are needed. TMH-QPSR Vol. 43 – Fonetik 2002 107 Studies of emotional speech features are often based on recordings of actors simulating a number of emotions in a studio environment. Usually, the researchers then have a panel of listeners label the emotional content of each speech sample. Typically, the speech consists of a few standard phrases or sentences, the lexical content remaining the same for each simulated emotion. If the listener-judges can “hear” the intended emotions, the recordings can be considered representative of the affective states under investigation (Iida et al, 1998). Because the speech material is identical, a detailed study of the acoustic/prosodic structure is justified: as the lexical content does not vary, the (spectral) differences between the speech samples should be caused by the intended emotional state, not by the sound structure. It is clear that the listeners may indeed identify the intended emotion, at the same time being fully aware that the emotion is simulated. However, this is not necessary a problem for information retrieval applications because much of the speech material stored in digital databases is, in fact, acted (radio plays, audio clips from movies, etc.). It is difficult to collect “genuine” emotional speech data. On the one hand, if the recording is made in a noisy and uncontrolled real-life situation, the stored speech signal is likely to be weak or distorted, which largely excludes a principled acoustic analysis of the data. On the other hand, ethical considerations are an important issue: is it morally acceptable, or even completely legal, to make the speaker genuinely angry, sad or scared for the purposes of collecting speech material? Certain previously existing speech data can be useful for research on the acoustic correlates of emotions. The radio news broadcast of the Hindenburg crash is perhaps the most famous publicly available emotional speech source. Similarly, recordings made in the context of interactive radio programs can contain interesting emotion-laden speech data. The problem is, however, that the researcher cannot verify which emotion the speaker intended to produce. Furthermore, copyright restrictions often apply to such data: one cannot take for granted that the data can be accessed even for strictly limited research purposes. Indeed, most of the emotional speech databases are not available for distribution because of copyright restrictions. The MediaTeam Emotional Speech Corpus contains some 90 minutes of systematically collected emotional speech. Fourteen actors, seven men and seven women (aged between 25 and 50), were recruited to simulate basic emotions. First, each speaker was asked to read out a Finnish passage in a neutral tone of voice. Second, the speaker was to summarize the semantic content of the passage in his/her own words (the text, dealing with kaarnikka or “crowberry”, was semantically as neutral as possible). Third, each speaker was asked to read out the text simulating the following emotions: happiness, sadness, fear, anger, boredom, and disgust. Fourth, the speakers were to interpret, in pairs, two colloquial dialogues containing predefined emotional lines (the instructions as to how to produce the lines were given in the manuscript). The material was digitally recorded with DAT in an unechoic studio to produce a 44.1-kHz, 16-bit CD-format recording. The database contains only Finnish but it will be enhanced with an English speech corpus. The database is currently the largest corpus of emotional speech for Finnish, and the linguistic units associated with specific emotional overtones range from short exclamations to monologues of approximately 30 seconds. The database contains speech reflecting the basic emotions, which are considered to include at least happiness, sadness, anger and fear (Izard, 1977). Information retrieval and prosodic information The main objectives of this research are twofold. First, the different portrayals are subjected to digital acoustic analysis to obtain profiles of vocal parameters for different emotions. The paradigm of parameters automatically computed from the speech signal includes the following features: maximum F0 minimum F0 range between maximum F0 and minimum F0 5 % percentile as the lower limit of the F0 range 95 % percentile as the upper limit of the F0 range range between 5 % percentile and 95 % percentile global jitter (%) global shimmer (%) Speech, Music and Hearing 108 amount of low-frequency energy (<1,000 Hz) in the spectrum F0 mean F0 median F0 mode F0 variance speaking rate (syllables/min.) articulation rate (syllables/sec.) average range of F0 rise (ST) average range of F0 fall (ST) average rate of F0 rise (ST/sec.) average rate of F0 fall (ST/sec.) maximum range of F0 rise (ST) maximum range of F0 fall (ST) maximum rate of F0 rise (ST/sec.) maximum rate of F0 fall (ST/sec.) mean intensity level (dB) maximum intensity level (dB) minimum intensity level (dB) variance of intensity The acoustic analysis is carried out by means of analysis software using MATLAB (a trademark of the MathWorks Inc., USA). Second, a panel of judges is asked to listen to the spoken passages reflecting the basic emotions: the listeners choose the label that best describes the emotional content of the recording. The responses of the listener-judges can thus be used to decide whether the produced stretches of speech reflect the intended emotions. On the basis of the results of the acoustic measurements and the recognition scores, statistical procedures are used to determine how well the emotions can be differentiated on the basis of the vocal parameters measured. In the statistical analysis, using the SPSS (a trademark of the SPSS Inc., USA), instead of looking for single correlations between specific emotions and specific acoustic parameters, the aim is to find out the relative weight of each acoustic parameter in the vocal signaling of each emotion. By means of multiple variable regression, and by means of the discriminant analysis, it is possible to determine which acoustic parameters differentiate best between different emotional categories (i.e. it is possible to determine the weight of each acoustic feature in the complex of acoustic features accompanying each emotion). By combining the traditional correlation analysis and the discriminant analysis, one can find out both the “sensitivity” and the “specificity” of each acoustic feature in the signaling of emotional content. The preliminary results of the discriminant analysis indicate that the core of the basic emotions, i.e. sadness, happiness, anger and the neutral tone of voice, can be recognized very reliably (91.1 % correct with learning set and 82.1 % correct with cross-validation). Although comparison between different studies of vocal correlates of emotions is often problematic (as the data used and the acoustic parameters measured are different), the recognition rate can be considered at least satisfactory: Banse & Scherer (1996) report a 50 % classification level for 14 emotions. Once the data is analyzed completely, the results can be applied to the development of content-based retrieval systems for digital audio libraries. Each acoustic/prosodic feature is connected with the retrieval system, and the user interface can be built on the principles of selforganizing maps: thus, when looking for “happy speech” in the audio database, the user can get search information listing audio files which are more or less likely to contain such speech. The candidate files can be displayed graphically, showing which emotional categories come closest to the files. The ultimate aim is to combine audio and video information in the retrieval system.
منابع مشابه
The Role of the FUM Students' Demographic Features in the Relevance Judgment Scores of Their Information Retrieval Results in Search Engines
In order to design user-friendly information retrieval systems, it is important to pay attention to characteristics of users. Therefore, the aim of the present study is to investigate the role of demographic variables of users during their search in search engines. Method: This is an applied study in terms of purpose, which was done by the evaluation method. To conduct the research, firstly,...
متن کاملA prosody-based vector-space model of dialog activity for information retrieval
Search in audio archives is a challenging problem. Using prosodic information to help find relevant content has been proposed as a complement to word-based retrieval, but its utility has been an open question. We propose a new way to use prosodic information in search, based on a vector-space model, where each point in time maps to a point in a vector space whose dimensions are derived from num...
متن کاملPerformance Evaluation of Medical Image Retrieval Systems Based on a Systematic Review of the Current Literature
Background and Aim: Image, as a kind of information vehicle which can convey a large volume of information, is important especially in medicine field. Existence of different attributes of image features and various search algorithms in medical image retrieval systems and lack of an authority to evaluate the quality of retrieval systems, make a systematic review in medical image retrieval system...
متن کاملEvaluating Prosody-Based Similarity Models for Information Retrieval
Prosody is important in spoken language, and especially in dialog, but its utility for search in dialog archives has remained an open question. Using prosody-based measures of similarity, which also roughly correlate with dialog-activity similarity and topic similarity, we built support for “retrieve more like this” searches. Performance on the Similar Segments in Social Speech Task at MediaEva...
متن کاملAn Effective Path-aware Approach for Keyword Search over Data Graphs
Abstract—Keyword Search is known as a user-friendly alternative for structured languages to retrieve information from graph-structured data. Efficient retrieving of relevant answers to a keyword query and effective ranking of these answers according to their relevance are two main challenges in the keyword search over graph-structured data. In this paper, a novel scoring function is proposed, w...
متن کاملReview of ranked-based and unranked-based metrics for determining the effectiveness of search engines
Purpose: Traditionally, there have many metrics for evaluating the search engine, nevertheless various researchers’ proposed new metrics in recent years. Aware of this new metrics is essential to conduct research on evaluation of the search engine field. So, the purpose of this study was to provide an analysis of important and new metrics for evaluating the search engines. Methodology: This is ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007